{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": true,
    "editable": true
   },
   "source": [
    "<h1> Structured data prediction using Cloud AI Platform </h1>\n",
    "\n",
    "This notebook illustrates:\n",
    "<ol>\n",
    "<li> Create a BigQuery Dataset and Google Cloud Storage Bucket \n",
    "<li> Export from BigQuery to CSVs in GCS\n",
    "<li> Training on Cloud AI Platform\n",
    "<li> Deploy trained model\n",
    "</ol>"
   ]
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
    ],
    "execution_count": null,
    "outputs": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
"Collecting google-cloud-bigquery==3.4.1\n",
"Downloading google_cloud_bigquery-3.4.1-py2.py3-none-any.whl (169 kB)\n",
     "|████████████████████████████████| 169 kB 4.7 MB/s eta 0:00:01\n",

"Requirement already satisfied:  six in /home/jupyter/.local/lib/python3.7/site-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: google-auth in /usr/local/lib/python3.7/site-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: google-resumable-media in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: google-cloud-core in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: protobuf in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: google-api-core in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: cachetools in /usr/local/lib/python3.7/dist-packages(from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: rsa in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: pyasn1-modules in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: googleapis-common-protos in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: pyasn1 in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from google-cloud-bigquery==3.4.1)\n",
"Installing collected packages: google-resumable-media, google-cloud-bigquery\n",
"\u001b[33mWARNING: You are using pip version 20.1; however, version 20.2.3 is available."
     ]
    }
   ],
   "source": [
    "!pip install --user google-cloud-bigquery==3.4.1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Restart your kernel to use updated packages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Kindly ignore the deprecation warnings and incompatibility errors related to google-cloud-storage**."
   ]
  },
{
 "cell_type": "markdown",
 "metadata": {
  "colab_type": "text",
  "id": "hJ7ByvoXzpVI"
 },
 "source": [
  "## Set up environment variables and load necessary libraries"
 ]
},
{
 "cell_type": "markdown",
 "metadata": {},
 "source": [
  "Set environment variables so that we can use them throughout the entire notebook. We will be using our project name for our bucket, so you only need to change your project and region."
 ]
},
{
 "cell_type": "markdown",
 "metadata": {},
 "source": [
  "**Note:** Replace the <code>REGION</code> with the associated region mentioned in the lab resource panel."
 ]
},
{
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true,
    "deletable": true,
    "editable": true,
    "jupyter": {
     "outputs_hidden": true
    }
   },
   "outputs": [],
   "source": [
    "# change these to try this notebook out\n",
    "BUCKET = 'cloud-training-demos-ml' # Replace with the your bucket name\n",
    "PROJECT = 'cloud-training-demos' # Replace with your project-id\n",
    "REGION = ' ' # Replace with region mentioned in the lab resource panel"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from google.cloud import bigquery"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"BUCKET\"] = BUCKET\n",
    "os.environ[\"REGION\"] = REGION\n",
    "os.environ[\"TFVERSION\"] = \"2.3\"\n",
    "os.environ[\"PYTHONVERSION\"] = \"3.7\""
   ]
  },

  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "export PROJECT=$(gcloud config list project --format \"value(core.project)\")\n",
    "echo \"Your current GCP Project Name is: \"$PROJECT"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "L0-vOB4y2BJM"
   },
   "source": [
    "## The source dataset\n",
    "\n",
    "Our dataset is hosted in [BigQuery](https://cloud.google.com/bigquery/). The CDC's Natality data has details on US births from 1969 to 2008 and is a publically available dataset, meaning anyone with a GCP account has access. Click [here](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=publicdata&d=samples&t=natality&page=table) to access the dataset.\n",
    "\n",
    "The natality dataset is relatively large at almost 138 million rows and 31 columns, but simple to understand. `weight_pounds` is the target, the continuous value we’ll train a model to predict."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create a BigQuery Dataset and Google Cloud Storage Bucket \n",
    "\n",
    "A BigQuery dataset is a container for tables, views, and models built with BigQuery ML. Let's create one called __babyweight__. We'll do the same for a GCS bucket for our project too."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "# Create a BigQuery dataset for babyweight if it doesn't exist\n",
    "datasetexists=$(bq ls -d | grep -w babyweight)\n",
    "\n",
    "if [ -n \"$datasetexists\" ]; then\n",
    "    echo -e \"BigQuery dataset already exists, let's not recreate it.\"\n",
    "\n",
    "else\n",
    "    echo \"Creating BigQuery dataset titled: babyweight\"\n",
    "    \n",
    "    bq --location=US mk --dataset \\\n",
    "        --description \"Babyweight\" \\\n",
    "        $PROJECT:babyweight\n",
    "    echo \"Here are your current datasets:\"\n",
    "    bq ls\n",
    "fi\n",
    "    \n",
    "## Create GCS bucket if it doesn't exist already...\n",
    "exists=$(gcloud storage ls | grep -w gs://${BUCKET}/)\n",    "\n",
    "if [ -n \"$exists\" ]; then\n",
    "    echo -e \"Bucket exists, let's not recreate it.\"\n",
    "    \n",
    "else\n",
    "    echo \"Creating a new GCS bucket.\"\n",
    "    gcloud storage buckets create gs://${BUCKET} --location=${REGION}\n",    "    echo \"Here are your current buckets:\"\n",
    "    gcloud storage ls\n",    "fi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "b2TuS1s9vREL"
   },
   "source": [
    "## Create the training and evaluation data tables\n",
    "\n",
    "Since there is already a publicly available dataset, we can simply create the training and evaluation data tables using this raw input data. First we are going to create a subset of the data limiting our columns to `weight_pounds`, `is_male`, `mother_age`, `plurality`, and `gestation_weeks` as well as some simple filtering and a column to hash on for repeatable splitting.\n",
    "\n",
    "* Note:  The dataset in the create table code below is the one created previously, e.g. \"babyweight\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocess and filter dataset\n",
    "\n",
    "We have some preprocessing and filtering we would like to do to get our data in the right format for training.\n",
    "\n",
    "Preprocessing:\n",
    "* Cast `is_male` from `BOOL` to `STRING`\n",
    "* Cast `plurality` from `INTEGER` to `STRING` where `[1, 2, 3, 4, 5]` becomes `[\"Single(1)\", \"Twins(2)\", \"Triplets(3)\", \"Quadruplets(4)\", \"Quintuplets(5)\"]`\n",
    "* Add `hashcolumn` hashing on `year` and `month`\n",
    "\n",
    "Filtering:\n",
    "* Only want data for years later than `2000`\n",
    "* Only want baby weights greater than `0`\n",
    "* Only want mothers whose age is greater than `0`\n",
    "* Only want plurality to be greater than `0`\n",
    "* Only want the number of weeks of gestation to be greater than `0`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE\n",
    "    babyweight.babyweight_data AS\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    CAST(is_male AS STRING) AS is_male,\n",
    "    mother_age,\n",
    "    CASE\n",
    "        WHEN plurality = 1 THEN \"Single(1)\"\n",
    "        WHEN plurality = 2 THEN \"Twins(2)\"\n",
    "        WHEN plurality = 3 THEN \"Triplets(3)\"\n",
    "        WHEN plurality = 4 THEN \"Quadruplets(4)\"\n",
    "        WHEN plurality = 5 THEN \"Quintuplets(5)\"\n",
    "    END AS plurality,\n",
    "    gestation_weeks,\n",
    "    FARM_FINGERPRINT(\n",
    "        CONCAT(\n",
    "            CAST(year AS STRING),\n",
    "            CAST(month AS STRING)\n",
    "        )\n",
    "    ) AS hashmonth\n",
    "FROM\n",
    "    publicdata.samples.natality\n",
    "WHERE\n",
    "    year > 2000\n",
    "    AND weight_pounds > 0\n",
    "    AND mother_age > 0\n",
    "    AND plurality > 0\n",
    "    AND gestation_weeks > 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Augment dataset to simulate missing data\n",
    "\n",
    "Now we want to augment our dataset with our simulated babyweight data by setting all gender information to `Unknown` and setting plurality of all non-single births to `Multiple(2+)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE\n",
    "    babyweight.babyweight_augmented_data AS\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks,\n",
    "    hashmonth\n",
    "FROM\n",
    "    babyweight.babyweight_data\n",
    "UNION ALL\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    \"Unknown\" AS is_male,\n",
    "    mother_age,\n",
    "    CASE\n",
    "        WHEN plurality = \"Single(1)\" THEN plurality\n",
    "        ELSE \"Multiple(2+)\"\n",
    "    END AS plurality,\n",
    "    gestation_weeks,\n",
    "    hashmonth\n",
    "FROM\n",
    "    babyweight.babyweight_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split augmented dataset into train and eval sets\n",
    "\n",
    "Using `hashmonth`, apply a module to get approximately a 75/25 train-eval split."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Split augmented dataset into train dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "CMNRractvREL"
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE\n",
    "    babyweight.babyweight_data_train AS\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks\n",
    "FROM\n",
    "    babyweight.babyweight_augmented_data\n",
    "WHERE\n",
    "    ABS(MOD(hashmonth, 4)) < 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Split augmented dataset into eval dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: []\n",
       "Index: []"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "CREATE OR REPLACE TABLE\n",
    "    babyweight.babyweight_data_eval AS\n",
    "SELECT\n",
    "    weight_pounds,\n",
    "    is_male,\n",
    "    mother_age,\n",
    "    plurality,\n",
    "    gestation_weeks\n",
    "FROM\n",
    "    babyweight.babyweight_augmented_data\n",
    "WHERE\n",
    "    ABS(MOD(hashmonth, 4)) = 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "clnaaqQsXkwC"
   },
   "source": [
    "## Verify table creation\n",
    "\n",
    "Verify that you created the dataset and training data table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight_pounds</th>\n",
       "      <th>is_male</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>plurality</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [weight_pounds, is_male, mother_age, plurality, gestation_weeks]\n",
       "Index: []"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "-- LIMIT 0 is a free query; this allows us to check that the table exists.\n",
    "SELECT * FROM babyweight.babyweight_data_train\n",
    "LIMIT 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>weight_pounds</th>\n",
       "      <th>is_male</th>\n",
       "      <th>mother_age</th>\n",
       "      <th>plurality</th>\n",
       "      <th>gestation_weeks</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [weight_pounds, is_male, mother_age, plurality, gestation_weeks]\n",
       "Index: []"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%bigquery\n",
    "-- LIMIT 0 is a free query; this allows us to check that the table exists.\n",
    "SELECT * FROM babyweight.babyweight_data_eval\n",
    "LIMIT 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export from BigQuery to CSVs in GCS\n",
    "\n",
    "Use BigQuery Python API to export our train and eval tables to Google Cloud Storage in the CSV format to be used later for TensorFlow/Keras training. We'll want to use the dataset we've been using above as well as repeat the process for both training and evaluation data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Exported qwiklabs-gcp-4b437f7e5bfff9dd:babyweight.babyweight_data_train to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train*.csv\n",
      "Exported qwiklabs-gcp-4b437f7e5bfff9dd:babyweight.babyweight_data_eval to gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/eval*.csv\n"
     ]
    }
   ],
   "source": [
    "# Construct a BigQuery client object.\n",
    "client = bigquery.Client()\n",
    "\n",
    "dataset_name = \"babyweight\"\n",
    "\n",
    "# Create dataset reference object\n",
    "dataset_ref = client.dataset(\n",
    "    dataset_id=dataset_name, project=client.project)\n",
    "\n",
    "# Export both train and eval tables\n",
    "for step in [\"train\", \"eval\"]:\n",
    "    destination_uri = os.path.join(\n",
    "        \"gs://\", BUCKET, dataset_name, \"data\", \"{}*.csv\".format(step))\n",
    "    table_name = \"babyweight_data_{}\".format(step)\n",
    "    table_ref = dataset_ref.table(table_name)\n",
    "    extract_job = client.extract_table(\n",
    "        table_ref,\n",
    "        destination_uri,\n",
    "        # Location must match that of the source table.\n",
    "        location=\"US\",\n",
    "    )  # API request\n",
    "    extract_job.result()  # Waits for job to complete.\n",
    "\n",
    "    print(\"Exported {}:{}.{} to {}\".format(\n",
    "        client.project, dataset_name, table_name, destination_uri))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Verify CSV creation\n",
    "\n",
    "Verify that we correctly created the CSV files in our bucket."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/eval000000000000.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/eval000000000001.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000000.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000001.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000002.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000003.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000004.csv\n",
      "gs://qwiklabs-gcp-4b437f7e5bfff9dd/babyweight/data/train000000000005.csv\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud storage ls gs://${BUCKET}/babyweight/data/*.csv"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check data exists\n",
    "\n",
    "Verify that you previously created CSV files we'll be using for training and evaluation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud storage ls gs://${BUCKET}/babyweight/data/*000000000000.csv"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training on Cloud AI Platform\n",
    "\n",
    "Now that we see everything is working locally, it's time to train on the cloud! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To submit to the Cloud we use [`gcloud ai-platform jobs submit training [jobname]`](https://cloud.google.com/sdk/gcloud/reference/ml-engine/jobs/submit/training) and simply specify some additional parameters for AI Platform Training Service:\n",
    "- jobname: A unique identifier for the Cloud job. We usually append system time to ensure uniqueness\n",
    "- job-dir: A GCS location to upload the Python package to\n",
    "- runtime-version: Version of TF to use.\n",
    "- python-version: Version of Python to use. Currently only Python 3.7 is supported for TF 2.3.\n",
    "- region: Cloud region to train in. See [here](https://cloud.google.com/ml-engine/docs/tensorflow/regions) for supported AI Platform Training Service regions\n",
    "\n",
    "Below the `-- \\` we add in the arguments for our `task.py` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "OUTDIR=gs://${BUCKET}/babyweight/trained_model\n",
    "JOBID=babyweight_$(date -u +%y%m%d_%H%M%S)\n",
    "\n",
    "gcloud ai-platform jobs submit training ${JOBID} \\\n",
    "    --region=${REGION} \\\n",
    "    --module-name=trainer.task \\\n",
    "    --package-path=$(pwd)/babyweight/trainer \\\n",
    "    --job-dir=${OUTDIR} \\\n",
    "    --staging-bucket=gs://${BUCKET} \\\n",
    "    --master-machine-type=n1-standard-8 \\\n",
    "    --scale-tier=CUSTOM \\\n",
    "    --runtime-version=${TFVERSION} \\\n",
    "    --python-version=${PYTHONVERSION} \\\n",
    "    -- \\\n",
    "    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \\\n",
    "    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \\\n",
    "    --output_dir=${OUTDIR} \\\n",
    "    --num_epochs=10 \\\n",
    "    --train_examples=10000 \\\n",
    "    --eval_steps=100 \\\n",
    "    --batch_size=32 \\\n",
    "    --nembeds=8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training job should complete within 15 to 20 minutes. To see the progress, in your Cloud Console, click on Navigation menu > AI Platform > Jobs. You do not need to wait for this training job to finish before moving forward in the notebook, but will need a trained model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check our trained model files\n",
    "\n",
    "Let's check the directory structure of our outputs of our trained model in folder we exported. We'll want to deploy the saved_model.pb within the timestamped directory as well as the variable values in the variables folder. Therefore, we need the path of the timestamped directory so that everything within it can be found by Cloud AI Platform's model deployment service."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428154223/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428221424/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/checkpoints/\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud storage ls gs://${BUCKET}/babyweight/trained_model"   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/saved_model.pb\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/assets/\n",
      "gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/variables/\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "MODEL_LOCATION=$(gcloud storage ls -- gs://${BUCKET}/babyweight/trained_model/2* | grep '/$' \\\n",    "                 | tail -1)\n",
    "gcloud storage ls ${MODEL_LOCATION}"   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploy trained model\n",
    "\n",
    "Deploying the trained model to act as a REST web service is a simple gcloud call."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Updated property [ai_platform/region].\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "gcloud config set ai_platform/region global\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Deleting and deploying babyweight ml_on_gcp from gs://michaelabel-gcp-training-ml/babyweight/trained_model/20200428230409/\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: Using endpoint [https://ml.googleapis.com/]\n",
      "Created ml engine model [projects/michaelabel-gcp-training/models/babyweight].\n",
      "WARNING: Using endpoint [https://ml.googleapis.com/]\n",
      "Creating version (this might take a few minutes)......\n",
      ".............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................done.\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "MODEL_NAME=\"babyweight\"\n",
    "MODEL_VERSION=\"ml_on_gcp\"\n",
    "MODEL_LOCATION=$(gcloud storage ls -- gs://${BUCKET}/babyweight/trained_model/2* | grep '/$' \\\n",    "                 | tail -1 | tr -d '[:space:]')\n",
    "echo \"Deleting and deploying $MODEL_NAME $MODEL_VERSION from $MODEL_LOCATION\"\n",
    "# gcloud ai-platform versions delete ${MODEL_VERSION} --model ${MODEL_NAME}\n",
    "# gcloud ai-platform models delete ${MODEL_NAME}\n",
    "gcloud ai-platform models create ${MODEL_NAME} --regions ${REGION}\n",
    "gcloud ai-platform versions create ${MODEL_VERSION} \\\n",
    "    --model=${MODEL_NAME} \\\n",
    "    --origin=${MODEL_LOCATION} \\\n",
    "    --runtime-version=2.3 \\\n",
    "    --python-version=3.7"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model and version creation should complete within few minutes.\n",
    "\n",
    "**Note**: If you are unable to access the generated endpoint link, please ignore it. To see the progress, in your Cloud Console, click on Navigation menu > AI Platform > Models."
   ]
  },   
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2023 Google Inc. Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
